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Abstract 


Weoutline:a  multiprocessor  architecture  that  uses  modular  arithmetic  to  im- 
plement  numerical  computation  with  900  bits  of  intermediate  precision.  A  proposed 
prototype,  to  be  implemented  with  off-the-shelf  pa^^-^  vill  perform  high-precision 
arithmetic  as  fast  as  som^  workstations  and  mini-computers  can  perform  IEEE 
double-precision  arithmetic.  -We  discusst<^ow  the  structure  of  modular  ^arithmetic 
conveniently  maps  into  a  simple,  pipelined  multiprocessor  architecture.  We  presents 
techniques  vee  developed  to  overcome  a  few  classical  drawbacks  of  modular  arith¬ 
metic.  Our  architecture  is  suitable  to  and  essential  for  the  study  of  chaotic  dynamical 
systems.  ^ 
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1.  Introduction 


We  have  designed  and  functionally  simulated  a  multiprocessor  architecture 
which  uses  modular  arithmetic  to  implement  fast,  extremely  precise  arithmetic  opera¬ 
tions.  The  structure  of  modular  arithmetic  exhibits  immense  parallelism,  allowing  an 
implementation  of  high-precision  fixed-point  arithmetic  that  is  comparable  in  speed  to 
t  he  IEEE  double-precision  arithmetic  (b4  bits)  provided  by  some  of  the  floating-point 
units  in  workstations  and  mini-computers.  By  implementing  multiple  fixed-point 
number  systems  on  top  of  a  modular  number  system  with  18  moduli  ranging  from 
2'®  -  1  to  Q®'*,  we  obtain  over  900  bits  of  intermediate  precision. 

High-pre  "ision  computation  is  essential  in  numerical  studies  of  chaotic  sys¬ 
tems.  The  behaviour  of  these  systems  are  extremely  sensitive  to  their  initial  condi¬ 
tions,  rendering  numerical  simulation  with  low-precision  arithmetic  extremely  difficult 
and  frequently  useless.  Chaos  theory,  combined  with  precise  numerical  simulation, 
has  been  applied  in  orbital  mechanics  [Sussman  88],  and  work  is  in  progress  on  using 
computation  and  chaotic  phenomena  in  physical  systems.  These  applications  require 
tremendous  amounts  of  high-precision  computation,  which  our  architecture  will  ef¬ 
fectively  provide.  We  also  believe  that  this  architecture  can  be  adapted  to  perform 
efficient  symbolic  algebra  and  cryptography. 

The  idea  of  using  the  modular  number  system  to  speed  up  computer  arith¬ 
metic  is  not  new.  Extensive  work  was  done  in  the  60’s  to  investigate  its  viability 
[Szabo  67].  Even  more  effort  was  spent  on  digital  signal  processing  applications 
[Soderstraiid  86].  However,  owing  to  difficulties  in  performing  division,  sign  detec¬ 
tion.  and  magnitude  comparison  in  this  representation,  modular  arithmetic  is  seldom 
used  in  general-purpose  computer  arithmetic. 

Our  approach  of  using  medium-sized  (<  64  bit)  binary  numbers  to  support 
a  900  bit  modular  arithmetic  system,  which  in  turn  implements  high-precision  fixed- 
point  arithmetic,  allowed  us  to  overcome  some  of  the  problems  associated  with  mod¬ 
ular  arithmetic.  The  approximate  magnitudes  of  the  900  bit  numbers  are  tracked 
by  a  conventional  floating-point  unit.  This  information  can  be  used  to  »cduce  in  the 
number  of  normalizations.  The  floating-point  estimates  produce  initial  guesses  for  a 
N('wton-Raphson  division  routine,  and  are  used  in  rough  magnitude  comparisons. 


We  start  with  a  brief  overview  of  modular  arithmetic  and  how  it  is  used  to 
implement  efficient  fixed-point  arithmetic.  We  discuss  how  we  avoid  some  intrinsic 
pitfalls  of  modular  ^lrithmetic,  and  how  modular  arithmetic  can  be  implemented  on 
pipelined,  parallel  hardware. 

2.  Modular  Arithmetic 

The  modular  arithmetic  system  is  also  known  as  the  residue  number  system 
(RNS).  A  number  is  represented  by  the  remainders  (digits)  formed  when  it  is  divided 
by  a  set  of  pairwise  relatively  prime  numbers  (moduli).  For  example,  the  integer  17io 
is  represented  in  an  RNS  with  moduli  {2, 3, 5, 7)  by  the  digits  {1,2, 2, 3}.  We  refer  to 
an  RNS  number  as  a  modnum. 

2.1  Modular  Arithmetic  Operations 

.A.n  RNS  consisting  of  relatively  prime  moduli  with  product  M  can  be  used 
to  represent  signed  integers  [(-M/2),  (M/2)  -  1].*  Three  basic  arithmetic  operations  - 
add.  subtract,  and  multiply  -  on  modnums  can  be  implemented  as  digit-wise  modular 
operations.  So,  in  an  RNS  with  moduli  {m,„_i), ...,  mi,mo}, 

K2^(n-l)>  op  {y(„_i),...,J/l,I/o}|M 

where  |j-|m  denotes  x  mod  m,  and  op  is  and  x  yields 

{l^(n-l)  op  y(n- 1)  |m(„_  ,  •••1  1^1  Op  yi|m,)|2^0  Op  J/olmo}- 

Digit-wise  addition  or  subtraction  modulo  an  RNS  modulus  (written  as 
and  ■  )  is  easy  because  the  “carry”  is  guaranteed  to  be  less  than  the  modulus.  The 
remainder  of  result  can  be  computed  in  at  most  one  more  subtraction  (called  carry- 
adjusf).  So,  if  X  and  y  are  mod  m, 

,  -  /  -c  +  y  if  x-Hy  <  m; 

•c  iJni  y  —  1  I  v.. 

(x-fy  —  m  )lx-|-y>m. 

Digit- wise  modular  multiplication  (©)  requires  a  full  remaindering  operation. 
Efficient  hardware  implementation  is  possible  if  the  moduli  are  restricted  to  numbers 
of  the  forms  2^  -i-  1,  2^,  and  2^-1.  Using  the  casting  out  nines  algorithm  [Knuth  69], 

'  As  in  two’.s-complement  a  negative  number  A’  is  represented  as  M  -f  A'. 
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1X|2P  =  X  mod  V, 

|A'|2p-i  =  (X  mod  V)  02P-i  (X  div  2^),  and 
IXI2P+1  =(X  mod  2P)  G2P+1  (X  div  2P), 

involving  only  bit  extraction  and  simple  digit-wise  modular  operations.  Choosing 
moduli  of  these  forms  also  facilitates  carry  detection  in  modular  addition  anH  sub¬ 
traction. 

Therefore,  simple  modnum  operations  can  be  reduced  to  digit-wise  operations 
modulo  the  respective  moduli  with  no  information  carried  between  the  digits.  This 
lack  of  a  carry  chain  eliminates  the  inherent  sequentiality  when  operating  on  successive 
digits  in  a  weighted  number  system,  allowing  full  parallelism  in  digit-wise  operations. 
Since  digit-wise  operations  may  occur  concurrently,  it  is  possible  to  implement,  on 
parallel  hardware,  modulo  M  arithmetic  in  the  time  required  to  perform  modulo  m 
arithmetic.  This  makes  modular  arithmetic  an  attractive  platform  for  implementing 
high-precision,  long  word-length  arithmetic  on  a  multiprocessor. 

2.2  Modnum  Division 

Although  generalized  division  in  the  residue  number  system  is  complicated 
and  ill-defined,  the  particular  case  of  division  by  a  product  of  any  of  the  moduli 
is  possible.  When  a  division  has  remainder  zero,  it  is  equivalent  to  multiplying  the 
dividend  by  the  divisor’s  multiplicative  inverse.  Since  each  modulus  is  relatively  prime 
to  all  the  other  moduli,  its  multiplicative  inverses  modulo  each  of  the  other  moduli 
is  defined.^  Hence  division  by  a  product  of  moduli  may  be  decomposed  into  a  series 
of  multiplication  of  the  dividend  by  the  inverses  of  the  divisor’s  factors,  provided  we 
guarantee,  at  each  step,  that  the  remainder  is  zero.  For  numbers  in  modular  form,  the 
modnum  digit  at  each  modulus  predicts  the  remainder  when  the  modnum  is  divided 
by  that  modulus.  Therefore,  to  truncate  the  original  dividend  or  each  intermediate 
quotient  to  a  multiple  of  the  next  divisor  modulus,  we  simply  subtract^  the  entire 
number  by  the  value  of  the  modnum  digit  at  the  divisor  modulus. 

Since  each  of  the  divisor’s  factors  has  no  multiplicative  inverse  modulo  itself, 
the  quotient  we  form  is  in  an  RNS  of  only  the  non-divisor  moduli.  Unique  repre- 

^  They  can  be  computed  by  the  Euclid  GCD  algorithm. 

^  We  can  also  round  up  by  adding  the  additive  inverse. 
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sentation  is  still  guaranteed  because  the  quotient  has  reduced  range  relative  to  the 
dividend.  The  base  extension  process  [Szabo  67]  recovers  the  missing  digits.  The 
algorithmic  structure  of  this  procedure  resembles  that  of  division. 

Each  step  in  modnum  division  starts  with  a  modnum  subtraction  and  a 
modnum  multiplication,  and  then  one  of  the  digits  is  broadcast  to  the  other  digit 
positions  for  the  next  subtraction.  The  number  of  steps  is  equal  to  the  number 
of  moduli.  Simultaneously,  the  algorithm  yields  digits  in  a  weighted,  mixed-radix 
representation,  to  allow  sign  detection  and  magnitude  comparison. 

The  preceding  section  was  meant  merely  as  a  reference  for  the  following 
chapters.  Complete  and  vigorous  treatments  of  modular  algorithms  can  be  found  in 
[Szabo  67]  and  [Knuth  69]. 

3.  Implementing  Fixed-point  Arithmetic 

Modnum  additions,  subtractions,  multiplications,  and  scaling  by  products 
of  moduli  are  used  to  implement  fixed-point  arithmetic.  A  fixed-point  number  /  is 
represented  by  the  modnum  n,  where  n  =  fR.  R  is  a,  fixed-point  radix  chosen  to  be  a 
product  of  moduli.  Each  “tick”  in  this  representation  is  Multiple  radix  points  can 
be  supported  simultaneously  to  ensure  precise  representation  over  a  wide  range. 

Fixed-point  addition  (and  hence  subtraction)  is  simply 

/i  A  — *  (/i  +  f2){R)  =  ni  +  n2 

and  therefore  equivalent  to  a  modnum  addition.  Fixed-point  multiplication  is 
/i  X  /2  (A  X  MR)  =  /,  X  M^)  = 

Division  by  R  “normalizes”  the  fixed-point  number  to  its  previous  representation,  at 
the  expense  of  some  precision.  Since  fl  is  a  product  of  moduli  the  division  can  be 
done  as  previously  described.  In  order  to  hold  the  quantity  (mi  x  m2)  before  scaling, 
.V/  >  and  so  /  <  to  prevent  overflow. 
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3.1  Minimizing  Normalization  Operations 


Normalizing  after  each  multiplication  is  expensive.  The  number  of  normaliza¬ 
tions  can  be  reduced  by  delaying  them  until  necessary.  For  example,  when  summing 
a  series  of  the  form  /r  =  /11/12  +  /21/22  + . -I-  fn\fn2  we  can  choose  to  sum  the  inter¬ 

mediate  products  with  the  radix  temporarily  raised  to  B? .  Only  one  normalization  is 
needed  for  the  entire  multiply-accumulate  operation. 

A  major  advantage  is  that  computation  proceeds  before  precision  is  lost 
through  normalization.  We  developed  a  scheme  in  which  we  track  the  approximate 
value  of  our  fixed-point  numbers  with  floating-point  numbers  [flonums).  Whenever  we 
perform  a  fixed-point  operation,  a  corresponding  flonum  operation  takes  place,  albeit 
with  less  precision.  .'\n  accurate  copy  of  the  floating-point  approximation  can  be 
constructed  using  the  mixed-radix  digits  generated  when  normalizing  products.  The 
flonum’s  magnitude  can  be  used  to  signal  the  need  to  normalize  and  avoid  overflow. 

.Another  trick  is  to  pre-scale  common  factors.  For  example,  if  the  expression 
Yi  =  A',  X  K  appears  withiii  a  loop,  K  may  be  scaled  once  outside  the  loop,  eliminating 
the  need  to  normalize  within  each  iteration.  If  the  approximate  dynamic  range  and 
inherent  accuracy  of  relevant  numbers  are  known,  either  a  priori  or  through  the 
flonum  approximation,  pre-scaling  can  be  handled  with  little  or  no  loss  in  precision. 

These  optimizations  can  be  statically  managed  by  the  programmer.  However, 
we  believe  that  automated  optimization  by  a  compiler  is  possible  [Dally  89].  The 
problems  of  computing  with  fixed-point  numbers  were  familiar  to  programmers  before 
floating-point  arithmetic  was  invented,  and  most  of  the  solutions  developed  would 
apply  here. 

3.2  Division,  Comparisons,  and  Sign-detection 

With  fixed-point  addition,  subtraction,  and  multiplication,  we  can  implement 
fixed-point  division  (reciprocals)  using  the  Newton-Raphson  approximation  method 
[AMD  88].  Because  this  algorithm  has  quadratic  convergence,  we  only  need  about 
10  iterations  to  generate  a  900  bit  reciprocal  even  if  we  start  with  an  initial  guess 
with  one  or  two  correct  bits.  Newton-Raphson  approximation  for  other  functions  are 
equally  applicable. 


We  can  quickly  compute  an  approximate  answer  to  the  function  we  are  com¬ 
puting  by  performing  floating-point  arithmetic  on  tracking  flonums.  The  answer  is 
used  to  index  into  a  precomputed  table  mapping  flonums  to  modnums.  A  common 
table  may  be  shared  for  all  approximation  methods. 

I’he  flonum  approximation  also  allows  gross  magnitude  comparisons  to  be 
done,  (’lose  calls  are  resolved  by  conversion  of  the  rnodnum  to  the  mixed-radix 
notation. 

We  can  efficiently  detect  positive  numbers  close  to  zero.  Since  we  ensure 
that  our  residue  representation  is  non-redundant  and  that  the  moduli  are  pairwise 
relatively  prime,  tl^e  only  case  in  which  all  the  digits  are  eciual  is  when  modnum 
n  ^  ...,  J7ii .  n’o)  '*  Since  this  condition  can  be  checker!  digit-wise,  it  can  be 

done  in  parallel.'’ 


4.  The  Modiiiuii  Parallel  Architecture 

A  block  diagram  of  the  Modnum  multiprocessor  is  shown  in  Figure  1. 


Figure  1:  'I  he  Modnum  Multiprocessor 

riie  architecture  specifies  a  number  of  digit  nodes,  each  computing  digit-wise 

^  *’roof:  It  i.s  “obvious”  that  *!ir  digit.s  art-  in  fai  i  equal  wlieii  m  is  smaller  than  thi-  small¬ 
est  tnodiili  rrimin,  and  since  each  number  is  uniquely  represented  and  the  representation  ic  non- 
re<lundant,  it  follows  that  the  digits  are  equal  if  and  only  if  n  < 

Assuming  the  accumulation  of  each  digit’s  boolean  result  can  be  done  at  once,  e  g,  wire 
ANDed  in  hardware. 
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operations  of  one  ino'lnliis.  I'liey  roiiiiininirate  f.hroiiRh  a  synclironous,  shaierl  bus. 
Kach  node  has  nioclular  aritlinieti(  hardware,  memory,  and  a  sequencer,  whicli  e.x<‘ 
cutes  nnnocode  that  is  potentially  different  on  each  node,  i  iie  digit-node  sequencer 
decodes  intcrorodt  instructions  fed  from  a  central  controllfr.  In  addition  the  con- 
trolh'r  performs  address  computations  and  feeds  the  computed  address<'s  to  the  digit 
nodes.  .Mso  sitting  on  the  shared  bus  is  the  trarkiiiq  twdr  a  lloating-point  unit  that 
snoops  on  the  controller-supplied  micro-instructions,  memor}’  addresses,  and  bus  com¬ 
munication.  It  has  its  own  nanocode  to  exercise  proper  control  of  its  floating-[K)itit 
hardware. 

For  the  prototype,  we  will  use  18  digit  nodes,  each  with  32-bit  datapaths  cy¬ 
cled  twice  to  implement  64-bit  arithmetic.  Our  chosen  set  of  moduli  ranges  from  2-’''-  I 
to  2®'*.  t  hese  choices  can  be  changed  conveniently  by  redoing  field-programmable  logic- 
devices. 

4.1  Digit  Node 


Each  node  computes  with  one  digit  of  the  modular  representation.  I'iach 
digit  of  a  inodnum  is  stored  on  the  corresponding  node. 
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Figure  2:  The  Datapath  F’ipeline 


d  he  datapath  pipeline,  shown  in  Figure  2,  is  optimized  to  perform  modu¬ 
lar  addition  and  subtraction.  A  new  nano-instruction  commences  every  pipestage. 
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I  he  instruction  and  the  least-significant  32-bit  words  (Isw)  of  the  modnum  operands 
are  fetched  in  the  first  stage  (Fetch).  Binary  addition  on  the  Isw’s  happens  during 
Operation.  The  result  Isw  is  compared  with  the  modulus  Isw  in  Carry-detect  Logic. 
Standard  registered  P.^L's  perform  this  function  while  simultaneously  latching  the 
result  to  maintain  the  pipeline.  The  next  stage  unconditionally  generates  a  carry- 
adjusted  Isvv.  Meanwhile  the  most-significant  words  of  the  operands  (msw)  have  been 
fetched  and  have  advanced  to  the  Operation  stage  to  yield  the  raw  result  msw.  The  full 
carry-detection  may  then  complete,  just  in  time  to  select  whether  the  raw-result  Isw 
or  the  carry-adjusted  Isw  gets  written  back  to  the  register  file.  In  the  next  nanocycle 
the  correct  msw  is  similarly  selected.  The  pipeline  allows  a  new  modnum  addition  or 
subtraction  to  commence  every  2  nanocycles. 

.\  32-bit  CMOS  multiplier  performs  64-bit  multiplication  in  7  nanocycles. 
The  multiplier  is  connected  to  the  arithmetic  datapath  (not  shown  in  diagram)  so  as 
to  allow  a  new  multiplication  to  commence  every  4  nanocycles. 

The  entire  datapath  can  be  implemented  with  currently  available,  stock 

II  1.  and  (’MOS  parts.  Standard  binary  arithmetic  units,  coupled  with  a  few  pro¬ 
grammable  logic  devices,  efficiently  perform  modular  arithmetic.  Each  pipestage  can 
( (unfortably  be  completed  in  80ns  with  the  parts  we  have  chosen. 

4.2  Instruction  Sequencing 

riu'  central  controller  supplies  microcode  to  cause  nanocode  routines  to  be 
(■,\(  (  uted  on  each  node.  This  approach  combines  the  benefits  of  SI.MI)  and  MIMD 
architectures.  Different  nodes  may  have  different  nanocode.  For  example,  the  tracking 
iiodt'  runs  nanocode  that  is  fpiite  different  from  that  on  digit  nodes,  and  we  ran 
[U'o'iram  some  digit  nodes  to  |)ertorm  two  32-bit  operations  with  short  moduli  in  tlu' 
time  it  takt's  other  nodes  to  complete  one  61-bit  operation,  f'ixed-point  opeuations 
are  seciuenced  as  microcode  instructions,  while  the  modnum  operations  imphuiient ing 
them  are  programmed  in  nanocode.  Synchronization  is  ensured  either  by  carefully 
generated  nanocode  sequences  of  known  length  or  by  wired-ANDed  status  lines. 

4.3  Communication 

do  perform  scaling,  one  selected  digit  of  each  step’s  result  is  broadcast  to  all 
(Uher  nodes,  d  his  is  done  .sequentially  on  the  sharf'd  bus.  .Since  ownership  of  the  bus 
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is  pre  determined  by  the  scaling  algorithm,  it  can  be  software  controlled  and  requires 
no  explicit  arbitration.  The  controller  can  also  access  memory  on  a  selected  node  and 
grant  it  ownership  of  the  bus.  The  tracking  node  may  also  own  the  bus,  e.g.  when  it 
has  to  broadcast  a  .\ewton-Raphson  initial  guess. 

5.  Performance  Estimate 

As  discussed  above,  modnum  (and  hence  fixed-point)  additions  and  sub¬ 
tractions  can  start  every  two  80ns  nanocycles,  so  the  peak  execution  rate  for  these 
instructions  is  6^  million  operations  per  second  (MOPS).  Primitive  multiplies  (with¬ 
out  normalization)  start  every  four  nanocycles,  to  yield  3|  MOPS  peak.  The  averngf 
speed  of  the  computer  will  depend  on  the  number  of  normalizations  required  by  the 
application  program  and  whether  operations  may  be  effectively  pipelined. 

6.  Conclusions 

VVe  designed  a  multiprocessor  architecture  to  perform  high-precision  arith¬ 
metic  very  efficiently.  We  used  the  modular  arithmetic  representation,  and  developed 
the  novel  method  of  floating-point  tracking  to  avoid  some  of  its  inherent  pitfalls.  Our 
architecture  is  suitable  for  implementation  with  currently  available  hardware,  and  the 
resulting  system  will  provide  high-precision  arithmetic  with  performance  comparable 
to  common  double-precision  floating-point  systems.  Our  system  has  immediate  ap¬ 
plications  in  the  study  of  chaotic  systems.  We  expect  further  applications  to  develop 
in  symbolic  algebra  and  cryptography. 

7.  Future  and  Relevant  Work 

We  plan  to  have  a  prototype  of  this  architeeture  implemented  and  tested  by 
.September,  1989.  At  that  point  we  wish  to  conduct  performance  measurements  on 
this  architecture.  We  will  continue  to  work  on  optimization  techniques,  and  look  into 
formal  studies  of  roundoff  errors  in  our  fixed-point  arithmetic. 

.Although  this  architecture  is  expected  to  deliver  satisfactory  performance 
for  our  initial  applications,  its  implementation  with  off-the-shelf  parts  requir('s  much 
hardware.  By  implementing  each  digit  board  {sans  memory)  in  a  single  VLSI  chip,  it 
may  be  possible  to  implement  an  entire  modnum  machine  on  a  single  printed-circuit 
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board.  This  board  will  be  attractive  ais  nodes  in  a  multiprocessor,  or  as  an  accelerator 
board  for  a  conventional  computer. 

The  design  of  this  architecture  is  just  a  means  to  an  end.  The  primary  goal 
of  the  project  is  to  perform  studies  on  chaotic  systems.  Not  only  will  this  computer  be 
used  to  perform  detailed  numerical  studies  of  chaotic  systems,  it  will  also  be  valuable 
for  calibrating  new  simulation  techniques  using  conventional  arithmetic,  and  may 
have  some  impact  on  the  field  of  numerical  analysis. 

Number  representations  using  continued  fractions  (and  continued  logarithms) 
also  promise  to  provide  efficient  arbitrary-precision  arithmetic  [Gosper  PC].  Most  re- 
( t'litl}-  [V’liillemin  88]  reports  some  breakthroughs  in  that  area.  The  Schonage-Strassen 
FFT  multiplication  algorithm  is  efficient  for  computing  long  word-length  products. 
Its  implementation  is  probably  most  suited  for  massively  parallel  machines  (such  as 
the  Connection  Machine)  and  word  lengths  a  few  orders  of  magnitude  greater  than 
those  we  plan  to  implement. 
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